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Abstract 

The Intel® Core™ i7 processor code named Nehalem 
provides a feature named Turbo Boost which opportunis- 
tically varies the frequencies of the processor's cores. 
The frequency of a core is determined by core temper- 
ature, the number of active cores, the estimated power 
consumption, the estimated current consumption, and 
operating system frequency scaling requests. For a chip 
multi-processor(CMP) that has a small number of physi- 
cal cores and a small set of performance states, deciding 
the Turbo Boost frequency to use on a given core might 
not be difficult. However, we do not know the complex- 
ity of this decision making process in the context of a 
large number of cores, scaling to the 100s, as predicted 
by researchers in the field. 

1 Introduction 

In recent times, we have seen the introduction of pro- 
cessors with multiple cores on chip, ranging from com- 
modity 2, 4, and 8-core processors, the Polaris 80-core 
research processor from Intel, and Tilera's 64-core and 
recent 100-core offerings. Given these trends and the 
progression of Moore's Law, researchers are envisioning 
processors with 100s if not 1000s of cores on a single 

chip nun). 

The architecture of such massively multicore chips 
poses new problems for scalability, both on hardware 
and software fronts, including performance, power con- 
sumption, memory and cache bottlenecks, software scal- 
ability to multiple threads, etc. Software written to take 
advantage of multiple cores, that is software that is multi- 
threaded is becoming increasingly prevalent. Various 
threading libraries that hide the complexities of multi- 
threading are being released - Cilk, Intel TBB, etc. How- 
ever, all available software need not scale to use all the 
available cores on a multicore processor; legacy single 
threaded software might still be in use as it is today. 



Given the above usage scenario, we can expect that 
some of the cores on a CMP will be idle and such cores 
can be transitioned to low power states (C states) and the 
frequency of busy cores can be boosted to provide im- 
proved performance. This behaviour forms the basis of 
the Turbo Boost feature present in the Intel® Core™ i7 
processors (codenamed Nehalem)||5). Turbo Boost is 
made possible by a processor feature named power gat- 
ing. Traditionally, an idle processor core consumes little 
or no active power (which is due to transistor switch- 
ing activity) while still dissipating static power due to its 
leakage current, even when it is operating at its lowest 
frequency. Power gating aims to cut the leakage current 
as well, thereby further reducing the power consump- 
tion of the idle core. The extra power headroom avail- 
able can be diverted to the active cores to increase their 
voltage and frequency without violating the power, volt- 
age, and thermal envelopes. Turbo Boost is one way of 
providing performance boost to applications in a mas- 
sively multicore processor setup. In this paper, we make 
a case that the current Turbo Boost mechanism is not suf- 
ficiently scalable and present a theoretical formulation of 
the problem involving optimal frequency assignment. 

In the remainder of the section, we present how Op- 
erating System Power Management (OSPM), Dynamic 
Voltage and Frequency Switching (DVFS), and Turbo 
Boost interact. In Section [2] we present the results and 
analysis of our experiments intended to understand the 
behaviour of Turbo Boost. We present the optimal fre- 
quency assignment problem formulation in Section|3]and 
related work in Section|4] In the rest of the paper, we will 
ocassionally use the term Turbo instead of the full name, 
Turbo Boost. 

1.1 OSPM, DVFS & Turbo Boost 

Platform BIOS exports a P-State table that contains the 
performance state information - a tuple containing the 
voltage and frequency identifiers - that need to be writ- 
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ten to hardware registers to change operating frequency 
of the processor core. The first entry in the table is re- 
ferred to as Pq and the last as P„; Pq corresponds to the 
highest frequency and P„ to the lowest frequency that the 
processor core can operate. Table[T]and Table[2]captures 
the frequencies reported by the BIOS with and without 
Turbo Boost respectively. These values were obtained 
from the sysfs entries exported by Linux, running on 
a quad-core Core™ i7 965 based machine. The Core 
i7 processor cores can operate at frequencies from 1.6 
GHz to 3.4 GHz in steps of 133.33 MHz. Of the oper- 
ating frequencies supported, one or more are marked as 
Turbo frequencies and the highest non-Turbo frequency 
is termed the maximum guaranteed frequency {P max ). On 
the Core i7, 3.3 and 3.4 GHz are configured as Turbo fre- 
quencies and 3.2 is P max - The 3.3 and 3.4 GHz frequen- 
cies are not reported directly to the OS. Instead, an oper- 
ating point at 3193000 kHz is reported. This frequency 
is not supported by the processor and is the Turbo Boost 
indicator. 

Under high load conditions, OSPM requests a fre- 
quency boost by writing the voltage and frequency iden- 
tifiers into the appropriate CPU registers for the partic- 
ular core. If the frequency requested is a Turbo Boost 
frequency, then firmware within the processor decides 
whether the operating frequency can be supported with- 
out violating the power and thermal constraints (listed 
earler), and if so, boosts the frequency to Turbo; Turbo 
Boost is an opportunistic feature. If the processor cannot 
provide the Turbo frequency, it operates at the maximum 
guaranteed frequency. Turbo Boost, therefore, is NOT 
under the control of OSPM and would appear quite non- 
deterministic from OSMP's perspective. 



Table 1: F 


requencies with Turbo 


P-State 


Frequency (in kHz) 


Po 


3193000 


Pi 


3192000 






Pn 


1596000 



Table 2: Frequencies without Turbo 



P-State 


Frequency (in kHz) 


Po 


3192000 






Pn 


1596000 



2 Deconstructing Turbo Boost 

The Turbo Boost algorithm is proprietary and we do not 
have information about its internal workings. There- 
fore, we run a set of experiments and gather data, which 
we analyse to obtain information about the behaviour of 
Turbo. We run the BLAST [1 1 benchmark - a suite of 
multithreaded applications which show varying rates of 
CPU utilization, and consequently could be expected to 
provide sufficient scope for Turbo Boost to engage. We 
measured CPU utilization using mpstat, and measured 
frequency using a frequency measurement tool devel- 
oped in-house. Frequency calculation was done imple- 
menting the algorithm provided in [6 1 (also summarized 
below). 

1 . The base operating ratio is obtained by reading the 
PLATF0RM_INF0 Model Specific Register (MSR). 
The base operating ratio is multiplied by the bus 
clock frequency (133.33 MHz) to obtain the base 
operating frequency. 

2. The Fixed Architectural Performance 
Monitor counters are enabled. Fixed Counter 1 
counts the number of core cycles while the core is 
not in a halted state (CPU_CLK_UNHALTED . CORE). 
Fixed Counter 2 counts the number of reference 
cycles when the core is not in a halted state 
(CPU-CLKJJN HALTED . REF). 

3. The two counters are read at regular intervals and 
the number of unhalted core cycles and unhalted 
reference cycles that have expired since the last 
iteration are obtained. Frequency is calculated as 

Fcurrent = Base Operating Frequency x ( 
Unhalted Core cycles / Unhalted Reference 
Cycles). The frequency calculation is repeated for 
each core. 

For simplicity of analysis, we disabled Simultaneous 
Multi Threading (SMT) on the processor, therefore each 
physical core supported only one thread context. We run 
the benchmarks with the ondemand and the userspace 
frequency governors of Linux. The frequency governor, 
and the operating frequency in the case of userspace are 
set prior to starting the benchmark. 

The ondemand governor takes system dynamics into 
consideration and varies the processor operating fre- 
quency. Consequently, it is the most power-performance 
efficient policy, but its dynamism makes for difficult 
analysis when the effects of Turbo Boost are also in- 
cluded. In order to isolate the effects of Turbo Boost, 
we use the userspace governor and set the frequency of 
all cores to P max . With this setup, the OSPM does not 
initiate frequency changes despite changes in CPU uti- 
lization, and all transitions in and out of the Turbo Boost 
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Figure 1: tblastx - ondemand governor & default BIOS settings 




Figure 2: tblastx - userspace governor & default BIOS settings 
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frequncies are the effects of the Turbo Boost algorithm, 
without OSPM requests. 

The graphs in figures [T] and [2] show the variation in 
frequency and utilization during one run of the tblastx 
benchmark from the BLAST suite with the ondemand 
and the userspace frequency governors respectively. In 
all the plots, the dashed line captures variation in fre- 
quency, whereas the the dot-dash line captures utiliza- 
tion. The solid horizontal line captures the maximum 
non Turbo Boost frequency (also refered to earlier as 
maximum guaranteed frequency or P max )- 

Interestingly, we see very little variation in the fre- 
quency of the cores despite substantial variations in the 
CPU utilization. The ondemand governor does not ap- 
pear to be particularly aggressive in pushing down fre- 
quencies of cores that are under low utilization. We do 
see however, that with all four cores active; the lower 
Turbo Boost frequency is reached and the higher Turbo 
Boost frequency is reached only when a single core is ac- 
tive. The userspace governor expectedly shows much 
lesser variation in core frequency. Except for one in- 
stance (for Core 0), the operating frequency of none of 
the cores goes below 3.2 GHz which we set prior to start- 
ing the benchmark. Even with the userspace governor, 
where all cores are operating at 3.2GHz frequency, we 
can observe that cores transition into the two turbo fre- 
quencies. 

2.1 >2 levels of Turbo 

By default, the processor cores are configured to oper- 
ate with a maximum guaranteed frequency of 3.2 GHz. 
However, these levels can be changed in the BIOS and 
for our second set of measurements, we set the maxi- 
mum guaranteed frequency at 2.96 GHz, and all frequen- 
cies above it (that is four operating frequencies from 3.05 
through 3.4 GHz) are considered as Turbo. We believe 
that understanding the behaviour of Turbo in such a setup 
is important, because, in a massively multi-core CMP, it 
is likely to have cores that run at low frequency (by de- 
fault), but also have multiple levels (more than just two) 
of Turbo Boost. With multiple applications executing 
on the massively multi-core processor, and each proces- 
sor core going through various levels of utilization, dif- 
ferent cores could be boosted to different Turbo levels 
as time progresses, provided that the power and thermal 
constraints (mentioned earlier) are not violated. 

Figure [3] captures the variation of core frequency and 
CPU utilization. Interestingly, the ondemand governor 
appears to become a lot smarter and more aggressive 
in reducing the frequency of idle cores. With the cus- 
tom BIOS settings, we can expect that applications incur 
performance loss because the maximum guaranteed fre- 
quency is lower compared to the default BIOS settings 



Benchmark 


% Reduction in Performance 


blastn 


5.4 


blastp 


9.6 


blastx 


12.2 


tblastn 


5.8 


tblastx 


13.0 


megablast 


13.1 



Table 3: Reduction in performance with custom BIOS 
settings 

and as expected, benchmarks suffered a performance 
loss, varying from 5.4% reduction for blastn to 13.1% 
reduction for megablast. 

2.2 Analysis & Observations 

From the figures presented, we can observe that when 
three cores are active, with the default BIOS settings, all 
the cores can simultaneously operate at 3.3 GHz. But 
with the modified BIOS settings, all the cores can oper- 
ate simultaneously at a maximum frequency of 3.2 GHz. 
Consequently, we can infer that the Turbo Boost algo- 
rithm is governed by the limits set in the BIOS. Strictly 
adhering to these limits could be acceptable and neces- 
sary from a reliability and repeatability standpoint. 

However, such a setting could result in the proces- 
sor operating at a lower frequency than what is possi- 
ble without violating the constraints. From the example 
just described, we can see that with the modified BIOS 
settings, when 3 cores are active, the cores operate at 
3.2 GHz though they could safely operate at 3.3 GHz 
as shown by our first set of experiments. The hard limits 
result in sub-optimal frequency assignment and to unre- 
alized performance. In a massively multicore setup, the 
unrealized performance that accrues across all processor 
cores could be substantial. Nor is it feasible to manually 
set limits on such a processor. 

Further, we see instances where, despite the CPU uti- 
lization falling, Turbo Boost increases the frequency of 
cores. This behaviour suggests that DVFS requests play 
an important role, and if Turbo can be supported, the pro- 
cessor will do so, despite sub-optimality - higher power 
consumption with no performance gain. Therefore a 
scalable, dynamic, and efficient mechanism of frequency 
assignment is required. 

3 Problem Formulation 

The best performance can be obtained when all active 
cores are operating at Turbo frequencies. However, this 
frequency assignment is impractical due to power, tem- 
perature, and voltage supply constraints within which 
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Figure 3: tblastx - ondemand governor & modified BIOS settings 



the processor must operate. Let us consider a mul- 
ticore processor with n homogeneous cores - cores 
that have a common instruction set architecture (ISA), 
the same manufacturing process, and the same power- 
performance profiles. At any given point of time, n s 
cores could be in a sleep state (idle), while the other n a 
cores are in active state. 

The processor supports a maximum guaranteed fre- 
quency (denoted /o) and some Turbo Boost frequencies 
fufl, ■ ■■Sm- At higher operating frequencies, a proces- 
sor core consumes higher power which also results in 
higher processor temperature. In the following sections 
that discuss the problem formulation, we abandon the 
ACPI style nomenclature of frequencies of the form P, 
used earlier. 

PaV 2 f (1) 

Every processor has a factory-specified maximum 
power limit Pow max and a critical shutdown temperature 
T cr j t . It is well known that power consumption of a pro- 
cessor core is proportional to the supply voltage and fre- 
quency(Equatior[T|i. The current processor temperature 
is represented as T (below T cr it) and each core consumes 
power po,pi... p m for the corresponding frequencies. 
Two decisions need to be taken: for the OS the deci- 
sion is to decide whether to activate one or more of the 
idle cores or to request Turbo Boost on the currently ac- 
tive cores; for the processor the decision is to identify the 
Turbo Boost frequency (if requested by OSPM) or ignore 
the OSPM request. 

We assume that at all times, the processor keeps a 



count of the current number of active cores, and a list 
of estimated power consumption. For example consider 
a processor with 4 homogeneous cores with two cores 
active and two cores in sleep state. The Turbo Boost al- 
gorithm could have a list of estimated power consump- 
tion of the active cores as shown in Table |4] If we stipu- 
late that all the cores must operate at the same frequency 
at any given point in time, this problem can be trivially 
solved by performing a table lookup. For instance, if two 
cores are active and the maximum allowed power con- 
sumption is 132W, then we can easily lookup the table 
and obtain the Turbo frequency as 2.5 GHz. 



Active 


Turbo Frequency 


Cores 


(2.3GHz) (2.5GHz) 


1 


129 131 


2 


130 132 


3 


136 138 


4 


140 142 



Table 4: Power estimates of active cores 



However, for processors where cores can operate at 
different frequencies the problem is considerably diffi- 
cult. We must chose an assignment of different frequency 
values to different cores subject to the power consump- 
tion constraint, such that the processing power is maxi- 
mized. We can model this optimization problem as the 
following Integer Linear Program (ILP). 

Let there be binary variables xij which take the value 
1 when core i is assigned frequency fj. Then, the power 
contribution of the core is Xij x pj and performance con- 



tribution is x%j x fj and we can formulate the following 
maximization problem. 

Maximize ^ ^ x ijfj 

such that ^ x UPj — P° w max (la) 
Vien,£i, 7 <l (lb) 



This is well known generalization of the Knapsack Prob- 
lem. Additionally, if the items are divided into classes 
and we are allowed to chose at most one item from each 
class. This modified problem is a case of the multiple 
choice knapsack problem (MCKP). When the objective 
function is a constant, that is, we are looking to achieve 
a particular performance value (as in our problem con- 
text), then the problem reduces to the subset-sum prob- 
lem. Since both the MCKP and subset-sum problems are 
NP -Complete, we have little hope of finding an algorithm 
to calculate the exact solution in polynomial time. 

However, we can devise an approximation algorithm 
that can run in a short time interval and provide the fre- 
quency assignment. Naturally, the frequency assignment 
will be sub-optimal from a performance point of view, 
but we can expect the algorithm to outperform the exist- 
ing algorithm and never violate power and thermal con- 
straints. 

Additional factors that could influence Turbo Boost 
are temperature; however, we did not observe a strong 
correlation between temperature and Turbo Boost en- 
try/exit therefore we did not incorporate temperature as a 
constraint in our problem formulation. Further, the cur- 
rent formulation does not incorporate task assignment or 
task properties. The problem we have presented how- 
ever, is not restricted to frequency assignment across a 
large number of cores; it is a problem of frequency as- 
signment to obtain the maximum performance for all 
applications under the various constraints. To that ef- 
fect, we can augment the frequency assignment such that 
cores are selectively Turbo Boosted depending on the ap- 
plication that is running on the core, where, cores run- 
ning CPU bound applications (which see a greater bene- 
fit from increased frequency) are boosted, whereas cores 
running memory bound applications (which do not see 
much benefit due to to frequency boosts) ifTUl are not 
boosted despite requests from OSPM to go into highest 
frequency. Such an approach would require information 
to be obtained from the hardware performance counters 
on each core. 



4 Related Work 

DVFS has been extensively used in many studies on 
power management. DVFS has been used in a variety of 
power management mechanisms, and the work by Govil 
et al being one of the earliest, where they compare 
algorithms for setting the CPU frequency. Ishihara et. al 
0, investigate optimal voltage assignment (formulated 
as an ILP). However, their approach requires informa- 
tion like the number of cycles of each task, the average 
switched capacitance per task, etc. These metrics would 
have to be gathered off line, consequently, it would not 
be possible to use the approach when a completely new 
workload is presented. 

Dhiman et al [2| propose a machine learning based ap- 
proach to vary dynamic power management policies and 
adapts to to varying system workloads. Their approach is 
a system level approach and includes peripheral devices 
in the power management scheme, specifically focusing 
on hard disk and wireless LAN device power manage- 
ment; CPU power management is not taken into consid- 
eration. 

In (9J, Murali et al study optimal, temperature-aware 
frequency assignment for multi-processor System on 
Chips (MPSoCs) in which they model and account for 
heat flow between parts of the MPSoC and use con- 
vex optimization for obtaining the optimal frequency as- 
signment. However, we focus particularly on the Turbo 
Boost feature. Further, their work incorporates lot of de- 
tails from the physics and VLSI design process into the 
formulation. This is justifiable from their goal which is 
to aid chip designers with an estimate of the power con- 
sumption characteristics. In contrast, our model tries to 
do the optimization at the firmware or OS level which 
is after processor fabrication and manufacturing. At this 
stage, the theoretical estimates of power and capacitance 
are not that useful because for a particular design or ar- 
chitecture the values may differ from gross estimates and 
the vendor provided thermal ratings are a closer approx- 
imation. 

The work by Isci et al Q comes closest to our work. 
Isci et al investigate per core and chip wide DVFS poli- 
cies to achieve maximum performance within a given 
power budget. They evaluate three local polices: Pri- 
ority, PullHiPushLo, and MaxBIPS and a global power 
policy. However, their evaluation considers a processor 
that has a small number of operating states; also, they do 
not use an ILP formulation. We are interested in under- 
standing the complexity of scaling a frequency assign- 
ment to a very large number of cores and providing op- 
timal frequency and performance without violating the 
power and thermal constraints, which inherently leads to 
the ILP formulation we presented earlier. 
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